Gene Module Identification from Microarray Data Using Nonnegative Independent Component Analysis

Genes mostly interact with each other to form transcriptional modules for performing single or multiple functions. It is important to unravel such transcriptional modules and to determine how disturbances in them may lead to disease. Here, we propose a non-negative independent component analysis (nICA) approach for transcriptional module discovery. nICA method utilizes the non-negativity constraint to enforce the independence of biological processes within the participated genes. In such, nICA decomposes the observed gene expression into positive independent components, which fits better to the reality of corresponding putative biological processes. In conjunction with nICA modeling, visual statistical data analyzer (VISDA) is applied to group genes into modules in latent variable space. We demonstrate the usefulness of the approach through the identification of composite modules from yeast data and the discovery of pathway modules in muscle regeneration.


Introduction
Genes often interact with each other to carry out cellular activities (Segal et al. 2004). The set of genes tightly regulated in a specifi c cellular process can be considered as a process-specifi c transcriptional module (Hughes et al. 2000;Brunet et al. 2004;Segal et al. 2004). It is important to identify such modules for understanding biological events associated with different experimental conditions, which may further help identify gene expression signatures associated with diseases.
Various methods have been proposed to identify gene transcriptional modules from microarray data. Several clustering techniques, such as hierarchical clustering (Spellman et al. 1998), k-means (Saeed Tavazoie et al. 1999) and self-organizing maps (Tamayo et al. 1999), are in common use for identifying meaningful subgroup genes exhibiting similar expression patterns. These approaches played a key role in gaining insights into the biological mechanisms associated with different physiological states. However, the clustering approaches are not well tuned for regulatory module identifi cation due to that: (1) a set of co-regulated genes may only co-express in a subset of experimental conditions, and (2) clustering the genes into one and only one group may also mask the interrelationships between genes that are assigned to different clusters but show local similarities in their expression patterns. Thus, biologists are more interested in fi nding the hidden regulatory patterns behind gene expression patterns, which strengthens the biological relevance of the grouped genes, i.e. the genes are co-regulated to form transcriptional modules.
Recently, matrix decomposition methods have been introduced to uncover transcriptional modules from microarray data, including independent component analysis (ICA) (Liebermeister, 2002;Lee and Batzoglou, 2003;Frigyesi et al. 2006) and nonnegative matrix factorizations (NMF) (Carmona-Saez et al. 2006;Lee and Seung, 1999;Kim and Tidor, 2003;Wang et al. 2006). These methods treat microarray data as a mixture of unknown factors (or components) that may correspond to specifi c biological processes. Specifi cally, the level of any given mRNA expression is modeled as the net sum of a complex superposition of cooperating and/or counteracting biological processes. ICA is a statistical method for revealing independent hidden factors that underlie sets of random variables or observations. In the context of microarray data, these statistically independent hidden factors may correspond to putative biological processes or transcriptional modules. It has been shown that the clusters found by ICA are directly associated with biological processes with common regulatory mechanisms (Lee and Batzoglou, 2003;Frigyesi et al. 2006). However, it is problematic to directly apply ICA to gene expression data due to its strong assumption of the independence of hidden variables in whole gene population. Biologically, it is more plausible to assume that the independence holds only for those genes that actively participating in biological processes. Therefore, we need to make further assumptions to constrain the ICA model for gene module identifi cation.
In this paper, we propose to use non-negative ICA (nICA) for gene module identifi cation (Ting et al. 2006;Oja and Plumbley, 2004;Plumbley, 2001). nICA exploits the non-negativity constraint to enforce the independence of biological processes within the participated genes. In principle, nICA can be thought as a projection method with which the expression levels (or ratios) are projected onto some new non-negative components with least statistical dependence. We believe that nICA provides a better model of gene expression data than ICA does, hence, more appealing for gene module identifi cation.
In the proposed approach, we fi rst perform input sample selection to improve the quality of separation of the components. We then develop a stability analysis procedure to determine the number of non-negative independent components. We further implement a learning algorithm of nICA with the non-negativity constraint for hidden component estimation. Finally, we use visual statistical data analyzer (VISDA), a data visualization and clustering tool (Wang et al. 2007b;Wang et al. 2003), to group the genes into modules in latent variable space. We demonstrate the effectiveness of the proposed approach for gene module identification using yeast and muscle regeneration datasets. The biological relevance of the identifi ed gene modules is validated by functional annotation analysis. Compared with conventional ICA-based approach, the proposed approach appears to have improved performance in fi nding biologically meaningful transcriptional modules.

Methods
The fl owchart of the proposed approach is outlined in Figure 1. The algorithm consists of the following components -input sample selection, stabilitybased dimension estimation, learning algorithm of nICA, and gene clustering by VISDA. We provide a detailed description of each component as follows.

Input sample selection
Input sample selection aims at selecting the most informative samples among the available samples for nICA decomposition. Without the proper selection of input samples, some computational problems such as increased computational complexity and degraded convergence may arise. Even worse, some samples may cause singularity problems for nICA decomposition. Principle components analysis (PCA), a variance based dimension reduction technique, is often used for input sample selection. But PCA is not always effective for nICA decomposition since the variance of a sample is not necessarily related to the importance of a sample.
Here, we propose to use mutual information (Vrins et al. 2003) to perform input sample selection. The objective is to select m' informative samples (v 1 ,...,v m ,) from a set of m samples (x 1 ,…, x m ), where m Ͼ m'. At each step of the algorithm, we choose a sample that is as statistically independent as possible (Hyvärinen et al. 2001) from the already selected samples v j, j = 1,…, k-1. In other words, x i is the k-th selected sample (i.e. v k ) if the following cost function f (i, k-1) (defi ned as the sum of mutual information) is minimized when i = l: where MI(.,.) denotes the mutual information that is defi ned as (Cover and Thomas, 1991): represents the entropy of a centered univariate random variable x i (v i ) and H(x i ,v j ) represents the joint entropy of two centered univariate random variable x i and v j (Bach and Jordan, 2003). Therefore, the selected subset (v 1 ,…,v m ,) will contain the samples that are mutually "quite different" as a result of the minimization of mutual information.

Stability-based dimension estimation
In practice, the number of independent components for nICA is often determined by the user's prior knowledge or obtained by PCA with a criterion of containing 95% of energy mainly to eliminate the noise effect (Hyvärinen et al. 2001). However, from our experience, it is often a diffi cult task in microarray data analysis to obtain a meaningful number of components by either the prior knowledge or PCA approach. When the number of components is incorrectly estimated, nICA will produce many possible false components for gene module identification. Hence, we propose to conduct stability analysis on gene expression data to estimate the number of components. Figure 2 shows the proposed stability-based schema, namely "splitting by samples", for reliable dimension estimation of nICA (Wang et al. 2007a).
The basic idea of the stability-based approach is that the nICA results from two data subsets sampled from a common data set should be consistent. The consistency (or similarity) of the nICA results from two non-overlapped subsets refl ects of the consistency between the assumed and underlying component numbers. More specifi cally, we split the samples into two non-overlapped subsets for nICA analysis and run the algorithm from i = 2 to full dimension of the subset of samples. We believe that if the dimension estimation really captures the underlying biological component number, the similarity score measured by mutual information between the components estimated from the two data subsets should give the best similarity score among all of the dimensions. When the estimated component number is not equal to the true number, the nICA results will show a tendency of mismatched components being estimated, hence, a decrease of similarity.
Due to the ambiguity of the scale in the nICA estimates, we need to normalize the estimated components and register them before calculating the similarity score. In our approach, we fi rst normalize Algorithm Ends the estimated components to be unit-variance variables. We then perform the registration (or alignment) of two permutated versions of components via an information theoretic approach. The exact way to align (or register) different pairs of components is by examining their mutual information. We calculate the similarity score after the alignment using averaged pair-wise mutual information: where MI(.,.) denotes the mutual information estimate as defi ned in Eq.
(2), ⎣ . ⎦ is the fl oor function, and s s ) is the i-th aligned pair of the components estimated from two different subsets. In order to obtain a reliable estimation of the dimension number, stability tests are performed P times independently (in our experimental design, we re-run the algorithm P = 100 times with random initialization), each time after a random shuffl ing to the order of samples. Finally, we choose the dimension with the largest similarity score averaged over P runs as the estimate of the component number.

Learning algorithm of nICA
We present a learning algorithm for nICA based on a latent model: where X denotes the microarray data matrix with rows corresponding to samples and columns N Y Split X into two non-overlapped subsets X1 and X2; each with round(m'/2) samples respectively.
Estimate the sources from X1 and X2 in every possible dimension number from 2 to round(m'/2).
Calculate the similarity scores by the averaged pair-wised mutual information.
Re-run the experiment 100 times?
Select the dimension with the largest similarity score as the estimated component number. corresponding to genes, S = (s 1 ,…,s n ) T represents the n independent biological processes, and A is the mixing matrix (matrix of contributions of each biological process). Suppose that s i (i = 1,…,n) has non-zero probability density function (pdf) all the way down to zero, then it has been proven (Oja and Plumbley, 2004;Plumbley, 2003;Plumbley, 2002) that we can fi nd Y = US where U is a square orthonormal rotation and permutation matrix. It is equivalent to say that the elements y i of Y are a permutation of components if and only if all y i are all non-negative. The above result can be used to derive a simple solution to the nICA problem. Since Y = US can also be written as Y = WZ with Z the pre-whitened observation matrix and W an un-known orthogonal (rotation) de-mixing matrix, it suffi ces to fi nd an orthogonal matrix W for which Y = WZ is non-negative. Therefore we can consider nICA as a procedure with the following two steps: 1) removing the second order statistics by whitening the data; 2) searching for a rotation matrix to make all the data fi t into the positive quadrant.
A learning algorithm to fi nd the de-mixing matrix W can be summarized as follows (Oja and Plumbley, 2004): 1) Pre-whiten the observed data X by the whitening matrix V: 2) Defi ne the cost function J as: and use a gradient descent algorithm to minimize the cost function J in Eq. (6): where γ is the step size.
3) Project the unconstraint gradient descent set onto a set of orthonormal vectors:

VISDA clustering
After performing nICA, we obtain the independent components representing some distinct biological processes. In these putative biological processes, the genes showing relatively high or low expression levels are most interesting. The analysis of gene patterns that are signifi cantly over-or under-expressed in the components may provide insights into the biological events associated with these latent processes. We fi rst use a pre-screening procedure to single out these genes and then apply VISDA to analyze the gene patterns in the latent space. In the pre-screening procedure, we fi rst sort the genes by their contributions (or loads) in each component, which creates a natural ordination in which genes are arranged based on their association with a given component. Then we select a subset of genes within each component, i.e. the over-expressed genes or under-expressed genes according to the value of each gene in the component (Lee and Batzoglou, 2003). By taking the union of the selected genes from each component, we form a pool of genes that we believe are closely related to the biological processes revealed by nICA. We then employ VISDA, a statistical model based clustering tool, to perform gene clustering on those selected genes in the latent space. Based on a hierarchical standard fi nite normal mixture (SFNM) model, VISDA captures the coherent structures in the latent space and performs topdown divisive clustering. The fi tting process of the SFNM model is achieved by the Expectation Maximization (EM) algorithm (Wang et al. 2007b), which maximizes the likelihood function. For each cluster at a level of the hierarchy, VISDA uses two different projection methods, principle component analysis (PCA) and PCA-projection pursuit (PCA-PPM) (Wang et al. 2007b), to visualize the sub-clusters within the clusters. The user chooses one of the projections that he/she thinks better revealing the data structures. On the chosen projection, user initializes models with different number of clusters by clicking on the computer screen at the centers of the clusters. These two-dimensional (2-D) models will be refi ned by the EM algorithm and compete according to Minimum Description Length (MDL) criterion or human justifi cation. The winning model in 2-D space will be transferred back to original data space to initialize the data model in that space. Then the EM algorithm in original data space will refi ne the model and obtain the partition of data at that level. At the top level, the whole dataset is split into several coarse clusters that may contain multiple functional modules; at lower levels, these coarse clusters are further decomposed into fi ner clusters, until no substructures can be found.

Results
We applied the proposed nICA approach to identify gene functional modules from the following three expression datasets: Dataset 1 -budding yeast during cell cycle CLB 2 /CLN 3 overactive stain (Spellman et al. 1998), consisting of spotted array measurements of 6,178 genes in 77 time points; Dataset 2 -yeast in various stressful conditions consisting of spotted array measurements of 6,152 genes in 173 experiments (Gasch et al. 2000); Dataset 3 -a 27-time points muscle regeneration series in vivo murine regeneration with Affymetrix oligonucleotide array measurements of 7,570 genes (Zhao et al. 2003). To determine whether the proposed approach can uncover the gene modules from gene expression data in the latent space, we mainly used the Biological Network Gene Ontology (BiNGO) tool (Maere et al. 2005) to evaluate the enrichment of functional annotations, and the Ingenuity Pathway Analysis (IPA) to assess the regulatory networks associated with the gene sets obtained by nICA.

Yeast cell cycle data
The yeast cell-cycle dataset was preprocessed to obtain log-ratios between red and green intensities, i.e. x i j = log 2 (r ij /g ij ). Since the data set contains both positive and negative log-ratio values, we need to perform data pre-treatment for nICA. We assume that distinct regulatory interactions are responsible for up-regulation versus down-regulation of gene expression. With the spirit of "divide and conquer", we split the data into two parts -positive and negative values corresponding to up-and down-regulated gene sets respectively -to fi t the nICA model.
Firstly, to prevent the over-learning problem the dimension of the data set was reduced using the input sample selection procedure described in the Methods section. We used the mutual information quality index, f (i,k-1) as in Eq. (1), to evaluate the samples for the most suitable number of inputs. Figure 3 shows the sum of mutual information measured for all the input samples in the positive part of the data. As we can see, there is an apparent increase at the dimension of 66. Therefore, we selected 65 samples for the positive part and 62 for the negative part (the fi gure is not shown here) for further nICA analysis. Secondly, we used the stability-based dimension estimation method to estimate the number of independent components. The results of stability analysis are shown in Figure 4, and an apparent peak is obtained from the averaged pairwise mutual information when the number of components is equal to 3. Thirdly, we applied the nICA learning algorithm to uncover the independent components for gene module identifi cation. Finally, we obtained the gene modules by gene clustering using VISDA in the latent space. The four most signifi cant gene clusters are given in Table 1. We measured the biological signifi cance of each cluster using the BiNGO tool. The p-value of each cluster was calculated according to its overlap with the functional annotations in Gene Ontology (GO) (see (Maere et al. 2005) for the detail).

Yeast dataset
Yeast dataset, which exhibits highly coordinated metabolic fl uctuations, gene expression patterns and cell division cycles, was cultured under diverse experimental conditions, temperature shocks, amino acid starvation, and progression into stationary phase (Gasch et al. 2000). This dataset has been extensively studied because of its importance in a variety of biotechnological applications. As in (Lee and Batzoglou, 2003), we also used KNNimpute to fi ll in the missing values (Troyanskaya et al. 2001). And due to the triviality of clustering environmental stress response (ESR) genes defi ned by (Gasch et al. 2000), we eliminated them in our analysis. The fi nal data set contains 5,284 genes and 173 samples.
In this case study, we focus on comparing the result from the nICA approach,which is enforced by the non-negativity constraint, with that from conventional ICA approach (Lee and Batzoglou, 2003) and NMF (another non-negative matrix decomposition approach) (Kim and Tidor, 2003;Lee and Seung, 1999). To objectively evaluate the clustering results from different methods, we used the z-score introduced in (Gibbons and Roth, 2002) to conduct a comparative study. As described in    The four most signifi cant clusters from nICA for the cell cycle data set. Numbers in parentheses in the fi fth column show the percentage of genes within the cluster that are presented in one of the functional category. And the numbers in the sixth column are presented in the similar way which corresponds to the total number within the whole genome set that are annotated with one of the special categories in GO system.  (Gibbons and Roth, 2002), the z-core is based on the mutual information between clustering results and the gene annotation. The higher scores indicate clustering results more biologically signifi cant. We compared the clustering results of nICA, NMF and ICA from small to larger cluster numbers and the z-scores are shown in Figure 5. As we can see from Figure 5, nICA consistently outperformed ICA with an average increase of z-score of 10. It is interesting to observe that the NMF performed slightly better that nICA when the number of cluster is small and nICA performed slightly better than NMF when the number of cluster becomes large. In our opinion, the overall performances of nICA and NMF are comparable. Figure 6 shows the scatter plots of the first three independent components from nICA and ICA, respectively. Since the process-specifi c genes are highly biased onto two orthogonal axes respectively shown in each sub-panel of Figure 6, we conclude that, comparing with ICA, nICA is quite effective in extracting non-negative independent biological processes. Finally, in Table 2, we list fi ve of the identifi ed co-regulated gene groups that show signifi cant enrichment in GO term categories.

Muscle regeneration data
We further applied the nICA approach to a time course microarray data set from a profi ling study of in vivo murine muscle regeneration. Staged skeletal muscle degeneration/regeneration was induced by injection of cardiotoxin (CTX) as described (Zhao et al. 2003). Mice were injected in gastrocnemius muscles of both sides, and then sacrifi ced at the following 27 time points: 0h(our), 12h, 1d(ay), 2d, 3d, 3.5d, 4d, 4.5d, 5d, 5.5d, 6d, 6.5d, 7d, 7.5d, 8d, 8.5d, 9d, 9.5d, 10d, 11d, 12d, 13d, 14d, 16d, 20d, 30d, and 40d (Zhao et al. 2003). Expression profi les were obtained with Affymetrix's U74Av2 and MAS 5.0 summarization algorithm. As a preprocessing step, we used the last time point as the reference point and the expression matrix consists of log-ratios of the expression measurements with respect to the reference point.
We then applied the nICA approach to the positive and negative parts respectively for gene module identifi cation. As a result, we found 11 clusters from the positive part of the data and 9 clusters from the negative part; all with significant biological coherence. Several clusters showed an expression pattern highly correlated with MyoD1 gene (Figure 7 shows an example of the heatmap of cluster 8 from the positive part of the data). MyoD1 has been widely studied for its important function in embryonic myogenesis and postnatal muscle regeneration. We also examined the biological relevance of these clusters. The results are shown in Table 3 and Table 4 (p-value less than 10 −4 is considered as signifi cant).
With the identifi ed gene clusters, we further used the Ingenuity Pathway Analysis (IPA) [10] to assess their biological plausibility, with respect to known information about gene regulatory networks, pathways and module functions. Among them, we found two clusters whose network functions are tightly related to skeletal and muscular system development (Fig. 8). Moreover, the cluster 13 found in the negative part contains Rb1 and cluster 1 found in the negative part contains MyoD1, which are two novel downstream targets of MyoD (Bakay et al. 2006).

Conclusion
This paper presents a new gene clustering approach, namely nICA-based approach, for composite gene module discovery. A complete algorithm of the nICA approach has been developed with the following main components: (1) input sample selection, (2) stability-based dimension estimation, (3) nICA learning algorithm and (4) gene clustering by VISDA. By projecting the gene expression data onto nICA space, co -regulation structures of the  modules can be revealed and highlighted. Using a pre-screening and VISDA clustering procedure, we can identify biological process enriched clusters with coherent functional annotations. The experimental results on the yeast data sets have demonstrated the advantages of the nICA approach over conventional ICA-based approach. The results also indicated that the performances of nICA and NMF are comparable. Further, the nICA approach has been applied to a muscle regeneration data set for novel gene module discovery. The results have shown that not only the identifi ed gene modules are biologically signifi cant and plausible, but novel downstream target genes can also be discovered by the nICA approach.     Figure 8. The top panel shows the cluster 6 found in the positive part of the muscle regeneration data with the following functions: posttranslational modifi cation, cellular growth and proliferation, and skeletal and muscular system development; The bottom panel shows the cluster 6 found in the negative part of the data with the main function of skeletal and muscular system development.
anonymous reviewers for their invaluable suggestions on the improvement of our manuscript.